-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP/provisioner: Handle the RESOURCE_NOT_FOUND error. #1842
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for fixing this issue @concretevitamin!
sky/backends/cloud_vm_ray_backend.py
Outdated
# https://github.com/skypilot-org/skypilot/issues/1797 | ||
# The VM may be alive on console. In the inner provision | ||
# loop we have used retries to recover but failed. The | ||
# provision loop will terminate the potentially live VMs | ||
# and move onto the next zone. Since the VM may have been | ||
# provisioned in this zone, it doesn't seem right to block | ||
# the current zone. | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have done the retry in _gang_schedule_ray_up
by adding that error in the need_ray_up
, we should add the zone to the block list here, to avoid the outer failover loop go to that zone again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, without -r
the outer loop will do a single pass over all zones, so this zone will not be tried again. So blocking vs. not blocking has the same effect. With -r
, it's desirable to retry this zone. Is this the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our yield zones will not loop through all the zones in a region for spot, as we currently rely on the optimizer to generate optimization result for per-zone. That is to say the failover for zones will happen here, but not here.
However, since we have add the to_provision
in the block list here, it should explain why it does not go to that zone again. I still prefer to add that to the block list here for consistency with the other parts. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's right. I got caught up thinking _yield_zones()
will go through the zones. Added now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! LGTM.
Fixes #1797.
I've tried a few days and finally got a set_trace inside
need_ray_up()
, tested that there.search()
there works. This doesn't test things end-to-end, however.Tested (run the relevant ones):
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh